Please, read my outlines for the data visualization course in the the follow README file here and deep in the knowlege of this advantage.

Preface

When I started coding for biology I realize on this amazing challenge about how to tell the history from a bunch of Next Generation sequencing datasets. Visualization of information (from massive data mining in special) become in a nice part of my data scientist training.

Trainning dataset

We will use high throughput sequencing dataset from a marine non model organism exposed to hidrocarbon polutant. The libraries sequenced were acqured from undifferented (Un) and sexual differented stages ( Male and female). In this experiment time 0 and three corresponded to hidrocarbon pollutant expossition (before and after, respectivily).

Libraries were preprocessing using the standar parameters within trimmomatic and the assembly were performed with trinity. Differential gene expression and annotation were perform with edgeR (an R package) and Trinotate, respectivily.

The chunk-code here correspond to my workflow used in the lab. Hope you enjoy it!

Required package:

# library(devtools)
# devtools::install_github("rlbarter/superheat")
# install_github("cstubben/trinotateR")
# devtools::install_github("ropensci/taxa")
# devtools::install_github("grunwaldlab/metacoder")
# if (!require("DT")) install.packages('DT')
# source("https://bioconductor.org/biocLite.R"); biocLite("org.Hs.eg.db"); (~ 74.3 MB)

Loading input files

First lets

dir <- c("/Users/cigom/Documents/GitHub/July_2018_bioinfo/infile/")
setwd(dir)
# x <- read.table(paste0(dir,"diffExpr.P1e-2_C1.matrix"))

And read the input file; then let’s make a head(x) from the file:

T0F3 T0F4 T0M1 T0M4 T0UR T0U
TRINITY_DN29981_c0_g1_i1 158.162 152.974 0.537 0.000 0.000 0.000
TRINITY_DN25178_c2_g1_i1 0.337 0.181 167.292 100.095 140.610 925.271
TRINITY_DN23469_c2_g2_i1 0.000 0.000 24.217 0.000 312.915 3789.992
TRINITY_DN27384_c1_g1_i1 54.064 150.841 0.000 0.000 0.000 0.095
TRINITY_DN32136_c1_g1_i2 0.000 0.000 34.300 3.078 48.656 93.654
TRINITY_DN24700_c1_g1_i3 0.000 0.000 2.190 0.000 50.995 2477.124

Let’s log2 transform the data

data = x # restore before doing various data transformations
data = log2(data+1)
data = as.matrix(data) # convert to matrix
data = t(scale(t(data), scale=F)) # Centering rows

Heatmap from Differential gene Expression

Using superheat

library(superheat)
superheat(data,
          # retain original order of rows/cols
          pretty.order.rows = TRUE,
          pretty.order.cols = TRUE,
          row.dendrogram = TRUE,
          left.label = "none",
          bottom.label.text.angle = 0,
          row.title = "Differential Expressed",
          column.title = "Samples")
example 1

example 1

Volcano plot

After process the differential gene expression analysis (Ex. running the run_DE_analysis.pl from Trinity framework) we can improve the data visualization as follow:

DE <- read.table(paste0(dir,"RSEM.isoform.counts.matrix.Female_vs_Undiff.edgeR.DE_results"))
library(ggplot2)
library(scales)

p <- ggplot(DE, aes(x=logFC, y=-log10(PValue))) + geom_point()
p
This is a Volcano plot

This is a Volcano plot

Let’s label by Fold Change (up/down) rate and significancy by the follow condition:

  • log Fold Change (logFC) > abs(2)
  • False Discovery Rate (FDR) < 0.05
  • Both statements: FDR < 0.05 and logFC > abs(2)
# logf>abs(2) fdr < 0.05 fdr < 0.05 and logfc> abs(2)
DE$Sig <- "Non Sig or basal"
DE$Sig[(abs(DE$logFC) > 2)] <- "logFC"
DE$Sig[(DE$FDR<0.05) & (abs(DE$logFC)>2)] <- "logFC_FDR"

p <- ggplot(DE, aes(x=logFC, y=-log10(PValue))) +
    geom_point(aes(color = Sig)) + theme_classic() +
    # RColorBrewer::display.brewer.all()
    scale_color_brewer()
p
Volcano; Color and fill

Volcano; Color and fill

Add lines and axis name:

p1 <- p +
        geom_vline(xintercept = 0) +
        geom_hline(yintercept = 0) +
        geom_hline(yintercept = -log10(0.0001), linetype = "dashed") +
        geom_vline(xintercept = c(-2, 2), linetype = "dashed")

And also let’s rename x and y axis using backquote macros:

p1 + xlab(bquote(~Log[2]~ "fold change")) + ylab(bquote(~-Log[10]~italic(P)))
Volcano lines and axis labeled

Volcano lines and axis labeled

# p2 <- p + xlab("Fold change (log2)") + ylab("P-Value")

Finally, let’s label the dots from the scatter:

library(ggrepel)
maxlab <- max(-log10(DE$PValue)) - 1 # select the points below the highest -log10(PValue) value to label

p2 <-  p1 + geom_text_repel(
          data = subset(DE, -log10(PValue) > maxlab),
          aes(label = Sig),
          size = 2.5,
          box.padding = unit(0.2, "lines"),
          point.padding = unit(0.2, "lines") 
  )

Finally, let’s compare the layouts:

library(gridExtra)
library(grid)
grid_arrange_shared_legend(p,p1,p2, ncol = 3)
Volcano comparison

Volcano comparison

Functional and enrichment annotation

After assembly denovo or guided transcriptome is common to map each contig to a reference in order to annotate the potential source of each transcript. In this view, blast2go is the average tool used by users to perform this analysis nevertheles blast2go is a non free tool. In spite of, Trinotate is a useful free-framework to this purpose. Trinotate makes use of a number of different well referenced methods for functional annotation including homology search to known sequence data (BLAST+/SwissProt), protein domain identification (HMMER/PFAM), protein signal peptide and transmembrane domain prediction (signalP/tmHMM), and leveraging various annotation databases (eggNOG/GO/Kegg databases).

In personal experience, trinotate had resulted in better performance (such less computational time) than the blast2go counterpart and useful to automatically annotate one workflow at time.

In addition, trinotateR is package developed by Chris Stubben with useful functions to “wrangling” the tab-delimited Trinotate.xls result.

library(trinotateR)
## Loading required package: data.table
y <- read_trinotate(paste0(dir,"Trinotate.xls"))
knitr::kable(summary_trinotate(y))
unique total
gene_id 61694 148534
transcript_id 147454 148534
prot_id 55785 55785
prot_coords 24521 55785
sprot_Top_BLASTX_hit 43800 45692
gene_ontology_blast 14000 44709
Kegg 15112 39430
eggnog 5651 37535
sprot_Top_BLASTP_hit 28519 35660
Pfam 26217 34005
gene_ontology_pfam 1775 20111
TmHMM 7764 10393
RNAMMER 9 9
SignalP 0 0
transcript 0 0
peptide 0 0

Most of the annotations contain mutliple hits in a backtick-delimited list and each hit contains multiple fields in a caret-delimited list. For example, the second Pfam annotation below contains two hits and each hit contains a pfam id, symbol, name, alignment and e-value. The split_pfam functions splits multiple hits and fields, so the second Pfam annotation is now printed in rows 2 and 3 below.

y1 <- split_pfam(y)
## 70856 Pfam annotations
head(y1,3)[,c(2,4:7)]
##                  transcript    pfam         symbol
## 1: TRINITY_DN10004_c0_g1_i1 PF01442 Apolipoprotein
## 2: TRINITY_DN10005_c0_g1_i1 PF16489           GAIN
## 3: TRINITY_DN10007_c0_g1_i1 PF00135     COesterase
##                                           name  align
## 1:               Apolipoprotein A1/A4/E domain  20-67
## 2: GPCR-Autoproteolysis INducing (GAIN) domain 17-121
## 3:                     Carboxylesterase family   3-85

Finally, the summary_pfam lists both, the number of unique Pfam identifiers and the total genes, transcripts and proteins with a Pfam annotation.

y2 <- summary_pfam(y1)
## 4849 rows
knitr::kable(head(y2[order(-y2$transcripts),]))
pfam symbol name genes transcripts proteins total
PF00386 C1q C1q domain 201 801 809 856
PF00069 Pkinase Protein kinase domain 279 594 594 611
PF07714 Pkinase_Tyr Protein tyrosine kinase 272 583 583 596
PF00400 WD40 WD domain, G-beta repeat 203 461 461 1351
PF12796 Ank_2 Ankyrin repeats (3 copies) 219 455 455 925
PF00023 Ank Ankyrin repeat 214 449 449 1031

The summary table also includes a count attribute with the number of unique genes, transcripts and proteins with a Pfam annotation, as well as the total number of annotations. In this example, there are 33,721 unique transcripts with a Pfam annotation and 56,642 total annotations to transcripts (since those may have more than one Pfam annotation).

c <- attr(y2, "count")
print(c)
##             unique annotations
## Pfam          4849       70856
## genes        14417       25280
## transcripts  33721       56642
## proteins     34005       56757

Implement datatable to present interactive data-table: from all the available genes/transcripts annotations.

## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Using trinotateR to get the Gene Ontology annotation.

A GO annotation is a statement about the function of a particular gene. Each GO annotation consists of an association between a gene and a GO term. Together, these statements comprise a “snapshot” of current biological knowledge. The GO describes function with respect to three aspects: molecular function (molecular-level activities performed by gene products), cellular component (the locations relative to cellular structures in which a gene product performs a function), and biological process (the larger processes, or ‘biological programs’ accomplished by multiple molecular activities)

Reference: http://www.geneontology.org/page/ontology-documentation, 2018

go <- split_GO(y)
## 473221 gene_ontology_blast annotations
gos <- summary_GO(go)
## 15386 rows
head(gos[order(-gos$transcripts),])
##            go           ontology                           name genes
## 1: GO:0005737 cellular_component                      cytoplasm  5087
## 2: GO:0005634 cellular_component                        nucleus  4286
## 3: GO:0016021 cellular_component integral component of membrane  3839
## 4: GO:0005829 cellular_component                        cytosol  3508
## 5: GO:0005886 cellular_component                plasma membrane  3366
## 6: GO:0046872 molecular_function              metal ion binding  2771
##    transcripts proteins total
## 1:       10521     8155 10550
## 2:        8687     6947  8725
## 3:        7817     6020  7832
## 4:        7300     5854  7325
## 5:        6305     4961  6315
## 6:        5815     4421  5829

And also determine the counts

cgo <- attr(gos, "count")
print(cgo)
##             unique annotations
## GO           15386      473221
## genes        19818      241902
## transcripts  44415      472268
## proteins     34463      379068

Using only the translated transcripts to get the differential Expressed annotations:

got <- na.omit(go, cols = "protein")
x$transcript <- rownames(x)
m <- merge(got, x, suffix = c("transcript"), all=FALSE)
m <- m[order(m$transcript),c(1,4,5:12)]

library(DT)
datatable(m , escape=1, options = list( pageLength = 10 ) )

Let’s draw some visualizations from the annotation enrichment throught a the transcriptome assembly:

library(ggpubr)
## Loading required package: magrittr
plotgos <- head(gos[order(-gos$transcripts),], 80)
ggbarplot(plotgos, "name", "transcripts",
          fill = "ontology", 
          color = "ontology",
   palette = c("#00AFBB", "#E7B800", "#FC4E07")) +
   theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 7))
Gene Ontology enrichment plot (top 80 transcripts)

Gene Ontology enrichment plot (top 80 transcripts)

   # facet_grid(ontology ~ .,) + theme(#strip.text.x = element_blank(), 
   #                              axis.text.x = element_text(angle = 90, hjust = 1))

And separate the ontology annotation for further analysis:

Var1 Freq
biological_process 10236
cellular_component 1571
molecular_function 3579

You can separete the go terms to perfom further test:

MF <- gos[gos$ontology=="molecular_function",]
CC <- gos[gos$ontology=="cellular_component",]
BP <- gos[gos$ontology=="biological_process",]

Semantic similarity (of GO terms)

Determine the similarity of two GO terms based on the annotation statistics of their common ancestor terms by computing semantic similarity among GO terms, sets of GO terms, gene products, and gene clusters, providing different methods than measure the information content (IC). To details please read the Wang’s paper published in Oxford.

First, build annotation data needed by GOSemSim via godata function. Based in figure from Gene Ontology Enrichment, we could focus on the Cellular component (CC) terms.

library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:data.table':
## 
##     first, second
## The following object is masked from 'package:base':
## 
##     expand.grid
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:data.table':
## 
##     shift
## 
hsGO <- GOSemSim::godata('org.Hs.eg.db', ont="CC", computeIC=FALSE)
## 
## preparing gene to GO mapping data...
go <- as.vector(CC$go)
go1 <- sample(go, 20)
go2 <- sample(go, 20)
gosim <- GOSemSim::mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL)

And visualize the similarity of the GO term.

superheat(gosim,
          # retain original order of rows/cols
          pretty.order.rows = TRUE,
          pretty.order.cols = TRUE,
          #left.label = "none",
          bottom.label.text.angle = 90,
          row.title = "Sample 1",
          column.title = "Sample 2",
          bottom.label.text.size = 4,
          left.label.text.size = 4
          )

Finally detache (unload) org.Hs.eg.db package.

detach(package:org.Hs.eg.db, unload = TRUE)

Blast hits annotation

From the annotation, lets use the blastx result in order to get the lineage from proteins anntotated.

blast <- split_blast2(y)
## 63540 sprot_Top_BLASTX_hit annotations
blast2 <- summary_blast(blast)
## 17039 rows
knitr::kable(head(blast2[order(-blast2$transcripts),]))
uniprot domain genus name genes transcripts proteins total
HS12B_HUMAN Eukaryota Homo Heat shock 70 kDa protein 12B 60 136 121 146
C1QL4_MOUSE Eukaryota Mus Complement C1q-like protein 4 47 122 106 124
HS12A_MOUSE Eukaryota Mus Heat shock 70 kDa protein 12A 51 113 90 116
HS12A_HUMAN Eukaryota Homo Heat shock 70 kDa protein 12A 53 99 64 99
PLCL_MYTGA Eukaryota Mytilus Perlucin-like protein 38 98 69 98
PLC_HALLA Eukaryota Haliotis Perlucin 34 97 81 100

Taxonomic tree:

Visualizing hierarchical information ### Metacoder Metacoder.

Let’s subset the virus lineage annotation:

vs <- blast[blast$domain == "Viruses",]
list <- vs$lineage

And plot …

library(metacoder)
## Loading required package: taxa
## 
## Attaching package: 'taxa'
## The following object is masked from 'package:ggplot2':
## 
##     map_data
## This is metacoder verison 0.2.1.9008 (development version). If you use metacoder for published research, please cite our paper:
## 
## Foster Z, Sharpton T and Grunwald N (2017). "Metacoder: An R package for visualization and manipulation of community taxonomic diversity data." PLOS Computational Biology, 13(2), pp. 1-15. doi: 10.1371/journal.pcbi.1005404
## 
## Enter `citation("metacoder")` for a BibTeX entry for this citation.
## 
## Attaching package: 'metacoder'
## The following object is masked from 'package:IRanges':
## 
##     reverse
library(RColorBrewer)
obj <- parse_tax_data(vs, class_cols = "lineage", class_sep = ";")
display.brewer.pal(n = 5, name = 'Set3') # Hexadecimal color specification
heat_tree(obj, 
          node_label = taxon_names,
          node_size = n_obs,
          node_color = n_obs,
          #edge_color = "grey",
          node_color_range = brewer.pal(n = 10, name = "Oranges"))
## Warning in brewer.pal(n = 10, name = "Oranges"): n too large, allowed maximum for palette Oranges is 9
## Returning the palette you asked for with that many colors
Tree example using metacoder

Tree example using metacoder

Tree example using metacoder

Tree example using metacoder

Which annotation could be true, let’s show the identity ?

heat_tree(obj, 
          edge_label = taxon_names,
          edge_label_size = 0.1,
          node_color = identity,
          edge_color = "grey",
          node_color_range = brewer.pal(n = 10, name = "Oranges"),
          node_size_axis_label = "n_obs",
          node_color_axis_label = "Identity")
## Warning in brewer.pal(n = 10, name = "Oranges"): n too large, allowed maximum for palette Oranges is 9
## Returning the palette you asked for with that many colors
## Warning: NAs found in the "node_color" option. These may cause plotting errors or other strange behavior. NAs can be created if there is not a value for each taxon. The following 28 of 48 taxa have NAs for the "node_color" option:
##   ab, ac, ad, ae, af, ag, ah, ai ... ba, bb, bh, bj, bm, bn, bo
Tree example 2; labeling by identity

Tree example 2; labeling by identity

ggtree

The ggtree is designed by extending the ggplot2 (Wickham 2009) package. It is based on the grammar of graphics and takes all the good parts of ggplot2. ggtree supports several layouts, including rectangular, slanted, circular and fan for phylogram and cladogram, equal_angle and daylight for unrooted layout, time-scaled and two dimentional phylogenies.

Reference: Yu et al. 2017, https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.12628

tbl <- head(vs[order(-vs$identity),], 50)
tbl <- tbl[!duplicated(tbl$transcript),] # remove duplicates annotations
write.table(tbl$transcript, file = "virus.list", quote = FALSE, col.names = FALSE, row.names = FALSE)

Run in clustalo web-service and input again in r the tree this is a Neighbour-joining tree without distance corrections.

tree <- treeio::read.newick(paste0(dir,"virus.tree.corrected.txt"))

And plot

library(ggtree)
## ggtree v1.12.0  For help: https://guangchuangyu.github.io/software/ggtree
## 
## If you use ggtree in published research, please cite:
## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy Tsan-Yuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):28-36, doi:10.1111/2041-210X.12628
## 
## Attaching package: 'ggtree'
## The following object is masked from 'package:IRanges':
## 
##     collapse
## The following object is masked from 'package:S4Vectors':
## 
##     expand
## The following object is masked from 'package:ggpubr':
## 
##     rotate
## The following object is masked from 'package:magrittr':
## 
##     inset
multiplot(
        ggtree(tree, branch.length = "none") + geom_treescale(width=0.4),
        ggtree(tree, branch.length='none', layout="daylight") + geom_treescale(width=0.4) + geom_nodepoint(),
        ggtree(tree, branch.length='none', layout='circular') + geom_treescale(width=0.4),
        ncol=3,labels = LETTERS[1:3])
## Average angle change [1] 0.099133807614635
## Average angle change [2] 0.0415093787405126

Then

p <-  ggtree(tree, branch.length='none', layout='circular') + geom_treescale(width=0.4) +
      geom_nodepoint(color="#b5e521", alpha=1/4, size=4) 

Modify tips and node geometry

multiplot(# add node points
        p + geom_nodepoint(),
        # add tip points
        p + geom_tippoint(),
        # Label the tips
        p + geom_tiplab(),
        ncol=3,labels = LETTERS[1:3])

Use annotation details to merge information within object (ie. tree and dataframe)

d <- p + geom_nodepoint()

tbl <- tbl[,-c(1)]
colnames(tbl)[1] <- "label"
tbl$genus <- gsub("unclassified ", "", tbl$genus)


t1 <- d %<+% tbl + 
    geom_tiplab(size=2.5, aes(label=paste0('bolditalic(', uniprot, ')')), parse=TRUE) +   
    geom_point(aes(color=genus), size=5, alpha=.5, na.rm = TRUE) + theme(legend.position="bottom")

t2 <- p %<+% tbl + 
    geom_tiplab(size=2.5, aes(label=paste0('italic(', uniprot, ')')), parse=TRUE)

multiplot(t1, t2, ncol=2)